In this project I am going to test different statistical learning methods in order to predict the price per night of an Airbnb apartment in the city of Barcelona. For this purpose I will use the following methods: Ridge Regression, Lasso Regression, Bagging, Random Forest and Gradient Boosting.

The data to train the model has been obtained from this project. In this website we can find a dataset with data related to airbnb apartments in several cities around the world. Each dataset is referred to a single city, and since I am from Barcelona and also because it was one of the cities with more data, I have decided to try and predict the price per night in my home city.
Since the dataset included a lot of variables, I have decided to load only those that seem more relevant
This dataset contains 20864 rows (observations) and 60 columns, and as we can already see, some variables have very messy data, and this will require a lot of data wrangling.
str(airbnb_data)
As a complement for the main dataset, I’ll be using the neighbourhoods_geojson dataset, which provides geospatial information from barcelona and its airbnb apartments (such as longitude, latitude, polygons, etc):
library(rgdal)
## Warning: package 'rgdal' was built under R version 4.0.2
## Warning: package 'sp' was built under R version 4.0.2
neighbourhoods_geojson <- rgdal::readOGR("C:/Users/marct/OneDrive - Tecnocampus Mataro-Maresme/Documentos/CURSOS/PROJECTES/AIRBNB PRICE PREDICTION/DATA/neighbourhoods.geojson")
First of all I load the lubridate package, which I use to convert the host_since variable, from a character to a date format. Then I extract the year from the variable I have just converted and reorder the data.
library(lubridate)
airbnb_data$host_since <- ymd(airbnb_data$host_since)
airbnb_data$host_since_year <- year(airbnb_data$host_since)
airbnb_data <- airbnb_data[, c(1, ncol(airbnb_data), 2 : (ncol(airbnb_data) - 1))]
After that, I convert the neighbourhood column into a factor and realise that it has spelling mistakes.
airbnb_data$neighbourhood <- airbnb_data$neighbourhood %>% as.factor()
levels(airbnb_data$neighbourhood)
## [1] "" "Camp d'en Grassot i Grà cia Nova"
## [3] "Can Baro" "Carmel"
## [5] "Ciutat Vella" "Diagonal Mar - La Mar Bella"
## [7] "Dreta de l'Eixample" "Eixample"
## [9] "El Baix Guinardó" "El Besòs i el Maresme"
## [11] "El Bon Pastor" "El Born"
## [13] "El Camp de l'Arpa del Clot" "El Clot"
## [15] "El Coll" "El Congrés i els Indians"
## [17] "el Fort Pienc" "El Gòtic"
## [19] "El Poble-sec" "El Poblenou"
## [21] "El Putget i Farró" "El Raval"
## [23] "Glòries - El Parc" "Grà cia"
## [25] "Guinardó" "Horta"
## [27] "Horta-Guinardó" "L'Antiga Esquerra de l'Eixample"
## [29] "La Barceloneta" "La Font d'en Fargues"
## [31] "La Guineueta - Canyelles" "La Maternitat i Sant Ramon"
## [33] "La Nova Esquerra de l'Eixample" "La Prosperitat"
## [35] "La Sagrada FamÃlia" "La Sagrera"
## [37] "La Salut" "La Teixonera"
## [39] "La Trinitat Vella" "La Vall d'Hebron"
## [41] "La Verneda i La Pau" "La Vila OlÃmpica"
## [43] "Les Corts" "Les Tres Torres"
## [45] "Montbau" "Navas"
## [47] "Nou Barris" "Pedralbes"
## [49] "Porta" "Provençals del Poblenou"
## [51] "Sant Andreu" "Sant Andreu de Palomar"
## [53] "Sant Antoni" "Sant GenÃs dels Agudells"
## [55] "Sant Gervasi - Galvany" "Sant Gervasi - la Bonanova"
## [57] "Sant MartÃ" "Sant Martà de Provençals"
## [59] "Sant Pere/Santa Caterina" "Sants-Montjuïc"
## [61] "Sarrià " "Sarrià -Sant Gervasi"
## [63] "Torre Baró" "Trinitat Nova"
## [65] "Turó de la Peira - Can Peguera" "Vallcarca i els Penitents"
## [67] "Verdum - Los Roquetes" "Vila de Grà cia"
## [69] "Vilapicina i la Torre Llobeta"
Since I am catalan, I am able to correct the spelling mistakes.
levels(airbnb_data$neighbourhood) <- c("" , "Camp d'en Grassot i Gràcia Nova", "Can Baró",
"Carmel" , "Ciutat Vella", "Diagonal Mar - La Mar Bella" ,
"Dreta de l'Eixample" , "Eixample" , "El Baix Guinardó",
"El Besós i el Maresme" , "El Bon Pastor" , "El Born" ,
"El Camp de l'Arpa del Clot", "El Clot" , "El Coll" ,
"El Congrés i els Indians" , "El Fort Pienc" , "El Gòtic" ,
"El Poble-sec" , "El Poblenou" , "El Putget i Farró" ,
"El Raval" , "Glòries - El Parc" , "Gràcia",
"Guinardó" , "Horta" , "Horta-Guinardó" ,
"L'Antiga Esquerra de l'Eixample", "La Barceloneta" , "La Font d'en Fargues" ,
"La Guineueta - Canyelles" , "La Maternitat i Sant Ramon" , "La Nova Esquerra de l'Eixample",
"La Prosperitat" , "La Sagrada Família", "La Sagrera" ,
"La Salut" , "La Teixonera", "La Trinitat Vella" ,
"La Vall d'Hebron" , "La Verneda i La Pau" , "La Vila Olímpica" ,
"Les Corts" , "Les Tres Torres" , "Montbau" ,
"Navas" , "Nou Barris" , "Pedralbes" ,
"Porta" , "Provençals del Poblenou" , "Sant Andreu" ,
"Sant Andreu de Palomar" , "Sant Antoni" , "Sant Genís dels Agudells" ,
"Sant Gervasi - Galvany" , "Sant Gervasi - la Bonanova", "Sant Martí" ,
"Sant Martí de Provençals", "Sant Pere/Santa Caterina", "Sants-Montjuïc" ,
"Sarrià " , "Sarrià -Sant Gervasi" , "Torre Baró" ,
"Trinitat Nova" , "Turó de la Peira - Can Peguera", "Vallcarca i els Penitents" ,
"Verdum - Los Roquetes" , "Vila de Gràcia" , "Vilapicina i la Torre Llobeta")
This variables also need to be treated, since they have special characters such as % or $, or they have to be coded as boolean factors. Here we can see some of the values that I am about to modify:
airbnb_data[1, c(4,5)]
## host_response_rate host_acceptance_rate
## 1 100% 91%
airbnb_data[1, c(30, 31, 32, 34)]
## price security_deposit cleaning_fee extra_people
## 1 $80.00 $100.00 $20.00 $0.00
airbnb_data[1, c(6, 11, 12, 20, 37, 51, 52, 53, 55, 56)]
## host_is_superhost host_has_profile_pic host_identity_verified
## 1 f t t
## is_location_exact has_availability requires_license instant_bookable
## 1 t t t f
## is_business_travel_ready require_guest_profile_picture
## 1 f f
## require_guest_phone_verification
## 1 f
Then I treat the previous variables that had special characters such as “%” or “$” using simple loops, and convert them into numeric variables. After that I substitute the “t” and “f” values from the boolean variables with TRUE and FALSE (respetively) and I convert them into factors through the last loop.
for(i in c(4, 5)){
airbnb_data[, i] <- as.numeric(gsub("%", "", airbnb_data[, i])) / 100
}
for(i in c(30, 31, 32, 34)){
airbnb_data[, i] <- as.numeric(gsub("\\$", "", airbnb_data[, i]))
}
for(i in c(6, 11, 12, 20, 37, 51, 52, 53, 55, 56)){
airbnb_data[, i] <- gsub("t", "TRUE", airbnb_data[, i])
airbnb_data[, i] <- gsub("f", "FALSE", airbnb_data[, i])
}
Now I create a list that I will use to spot strange values and NA’s by converting all the variables in factors and copying the levels of the ith variable into the ith element of a list. I use the “jsonedit” function from the listviewer package in order to create an interactive and easy list.
airbnb_data_factor <- airbnb_data
factor_list <- list()
for(i in 1:ncol(airbnb_data_factor)){
airbnb_data_factor[, i] <- as.factor(airbnb_data_factor[, i])
factor_list <- lapply(airbnb_data_factor, levels)
}
library(listviewer)
jsonedit(factor_list)
After exploring this list I observe that the host_verifications and amenities columns need some treatment in order to be used in any statistical model (I need to create dummy variables with their values). This is an example of the format of this columns’ values.
airbnb_data$host_verifications[1]
## [1] "['email', 'phone', 'reviews', 'manual_offline', 'jumio', 'offline_government_id', 'government_id', 'work_email']"
airbnb_data$amenities[1]
## [1] "{TV,\"Cable TV\",Internet,Wifi,\"Air conditioning\",Kitchen,Elevator,Heating,\"Family/kid friendly\",Washer,Dryer,\"Smoke alarm\",\"Carbon monoxide alarm\",\"Fire extinguisher\",Essentials,\"translation missing: en.hosting_amenity_49\"}"
In the following chunk I create two dataframes, after deleting the strange or unuseful characters that I have spoted using the previous list.
library(mgsub)
library(stringr)
host_verifications <- as.data.frame(mgsub(airbnb_data$host_verifications, c("\\[", "\\]", "\\,", "\\'") , c("", "", "", "")))
amenities <- as.data.frame(mgsub(airbnb_data$amenities, c("\\{", "\\}", "\\,", "\\'", "\\[", "\\]", "\"", "\\/") , c("", "", " ", "", "", "", "", "")))
Now I create two listes for each of the variables which I will use in the following loop. Since both variables (host_verifications and amenities) are dataframes of the same dimension, I create a loop iterating on the columns of host_verifications. This loop starts by splitting the ith row of the host_verification dataframe (and since str_split returns a list with only one element I subset the first element of each list) and saves this vector into the ith element of the output_list_verifications. Then I count the number of elements of the ith vector of the output_list_verifications list and associate this result to the ith vector of the output_list_verifications_count, which I will use later to see which is the vector with more components and create the dummy variables.
output_list_verifications <- list()
output_list_verifications_count <- list()
output_list_amenities <- list()
output_list_amenities_count <- list()
for(i in 1:nrow(host_verifications)){
output_list_verifications[i] <- str_split(host_verifications[i, ], " ")[1]
output_list_verifications_count[i] <- length(output_list_verifications[[i]])
output_list_amenities[i] <- str_split(amenities[i, ], " ")[1]
output_list_amenities_count[i] <- length(output_list_amenities[[i]])
}
Now I extract the largest vector (the vector that contains more values) of the output_list_verifications list, by filtering the vector that has the highest value in the output_list_verifications_count list and save the result in dummy_host_verifications_cols. After that I create an empty dataframe called dummy_host_verifications_df (with the same rows than the original dataset airbnb_data and the number of columns from dummy_host_verifications_cols), and I repeat the same process for the amenities variable. Below you can see the dummy_host_verifications_df
dummy_host_verifications_cols <- output_list_verifications[[which.max(output_list_verifications_count)]]
dummy_host_verifications_df <- matrix(nrow = nrow(airbnb_data), ncol = length(dummy_host_verifications_cols)) %>% as.data.frame()
colnames(dummy_host_verifications_df) <- dummy_host_verifications_cols
dummy_amenities_cols <- output_list_amenities[[which.max(output_list_amenities_count)]]
dummy_amenities_df <- matrix(nrow = nrow(airbnb_data), ncol = length(dummy_amenities_cols)) %>% as.data.frame()
colnames(dummy_amenities_df) <- dummy_amenities_cols
head(dummy_host_verifications_df)
## email phone facebook reviews manual_offline jumio offline_government_id
## 1 NA NA NA NA NA NA NA
## 2 NA NA NA NA NA NA NA
## 3 NA NA NA NA NA NA NA
## 4 NA NA NA NA NA NA NA
## 5 NA NA NA NA NA NA NA
## 6 NA NA NA NA NA NA NA
## sent_id selfie government_id identity_manual work_email
## 1 NA NA NA NA NA
## 2 NA NA NA NA NA
## 3 NA NA NA NA NA
## 4 NA NA NA NA NA
## 5 NA NA NA NA NA
## 6 NA NA NA NA NA
Finally I bind the new dataframes (dummy_host_verifications_df and dummy_amenities_df) to the original dataset (airbnb_data).
airbnb_data <- cbind(airbnb_data, dummy_host_verifications_df, dummy_amenities_df)
Now I susbtitute the NA’s with Yes (if the apartment has that verification or amenitie, and no if it doesn’t). In this case I use a loop nested inside another loop (a loop for each list/variable), where the outer loop iterates over the length of the output_list_verifications / output_list_amenities and the inner one over the dummy columns that I have just created (which I specified manually). If the name of the jth variable is in the lth vector of the output list, then the value of the ith row and jth column of the airbnb_data corresponds to “Yes” (meaning that the variable name appears in the list).
for(l in 1:length(output_list_verifications)){
for(j in 62:74){
if(colnames(airbnb_data)[j] %in% output_list_verifications[[l]]){
airbnb_data[l, j] <- "Yes"
} else{
}
}
}
for(l in 1:length(output_list_amenities)){
for(j in 75:ncol(airbnb_data)){
if(colnames(airbnb_data)[j] %in% output_list_amenities[[l]]){
airbnb_data[l, j] <- "Yes"
} else{
}
}
}
Now I convert all the new columns into factors
for(j in 62:ncol(airbnb_data)){
for(i in 1:nrow(airbnb_data)){
if(is.na(airbnb_data[i, j]) == TRUE){
airbnb_data[i, j] <- "No"
}
}
airbnb_data[, j] <- factor(airbnb_data[, j], levels = c("Yes", "No"))
}
I delete some unuseful variables, and create new ones: days_since_host, days_since_first_review, days_since_last_review, which will allow me to introduce the time component into the model, since the statistical methods I am about to use, don’t deal well with dates and times.
airbnb_data$amenities <- NULL
airbnb_data$host_verifications <- NULL
airbnb_data$has_availability <- NULL
airbnb_data$is_business_travel_ready <- NULL
airbnb_data$days_since_host <- (today() - as.Date(airbnb_data$host_since)) %>% as.numeric()
airbnb_data$days_since_first_review <- (today() - as.Date(airbnb_data$first_review)) %>% as.numeric()
airbnb_data$days_since_last_review <- (today() - as.Date(airbnb_data$last_review)) %>% as.numeric()
airbnb_data$host_since <- NULL
airbnb_data$first_review <- NULL
airbnb_data$last_review <- NULL
airbnb_data$host_response_time <- NULL
airbnb_data$host_listings_count <- NULL
Now I delete the NA’s from these columns, which I will use to clean the other variables by substituting NA’s and strange values with the mean (in case of numerical variables), or the mode (in case of categorical ones).
airbnb_data <- airbnb_data[complete.cases(airbnb_data$host_since_year) & complete.cases(airbnb_data$country) & complete.cases(airbnb_data$city), ]
I use the same list as before to spot NA’s, missing values and strange values:
airbnb_data_factor <- airbnb_data
for(i in 1:ncol(airbnb_data_factor)){
airbnb_data_factor[, i] <- as.factor(airbnb_data_factor[, i])
factor_list <- lapply(airbnb_data_factor, levels)
}
library(listviewer)
jsonedit(factor_list)
The strange values are : "“,”N/A“, NA,”-“,”*“,”.“,”[no name]“,”.". Firstly, I create a for nested in another for, which iterates over columns (j), rows(i) and patterns (k). If it detects that a value that is not N/A from the ith row and jth column, contains any of the k patterns, then it converts that value in NA. This conversion will come in handy in the following chunk codes.
patterns <- c("N/A", "-", "*", "[no name]", ".", "")
for(j in 1:ncol(airbnb_data)){
for(i in 1:nrow(airbnb_data)){
for(k in 1:length(patterns)){
if(airbnb_data[i, j] == patterns[k] & !is.na(airbnb_data[i, j])){
airbnb_data[i, j] <- NA
}
}
}
}
I load the modeest package in order to be able to apply the mlv function (which calculates the mode of a vector) and also the stringr library, which I’ll use later. In this loop I iterate over columns (j) and rows(i). If the element from the ith row and jth column is NA, and is also a numeric value, it is substituted by the mean of the jth variable (after being filtered to the same year, country and city values of the ith row). If is NA and character or factor, then it is substituted by the mean of the jth variable (also after being filtered to the same year, country and city of the ith row).
library(modeest)
library(stringr)
for(j in 1:ncol(airbnb_data)){
for(i in 1:nrow(airbnb_data)){
if(is.na(airbnb_data[i, j]) == TRUE){
if(is.numeric(airbnb_data[, j]) == TRUE){
airbnb_data[i, j] = median(airbnb_data[complete.cases(airbnb_data) & airbnb_data$host_since_year == airbnb_data[i, "host_since_year"]
& airbnb_data$country == airbnb_data[i, "country"] & airbnb_data$city == airbnb_data[i, "city"] , j])
} else if(is.character(airbnb_data[, j]) == TRUE | is.factor(airbnb_data[, j]) == TRUE){
airbnb_data[i, j] = mlv(airbnb_data[complete.cases(airbnb_data) & airbnb_data$host_since_year == airbnb_data[i, "host_since_year"]
& airbnb_data$country == airbnb_data[i, "country"] & airbnb_data$city == airbnb_data[i, "city"], j])[1]
}
}
}
}
Then I filter out the possible NA’s left (in case there is any NA left) and substitute the " " spaces from the column names with "_", using an easy loop. Finally I delete unuseful columns (mainly factor columns containing a single level), and convert the neighbourhood variable into a factor.
airbnb_data <- airbnb_data[complete.cases(airbnb_data), ]
for(i in 1:ncol(airbnb_data)){
colnames(airbnb_data)[i] <- gsub(" ", "_", colnames(airbnb_data)[i])
}
airbnb_data <- airbnb_data[, !duplicated(colnames(airbnb_data))] # we delete duplicated columns (because some dummy variables had very similar names and were detected by R as identical)
airbnb_data <- airbnb_data[airbnb_data$city == "Barcelona", ] # I filter out possible other cities (since there was a row containing a different city value)
airbnb_data$country <- NULL
airbnb_data$state <- NULL
airbnb_data$smart_location <- NULL
airbnb_data$city <- NULL # since there's only one city (Barcelona)
airbnb_data$requires_license <- NULL # since it is a factor with one level
airbnb_data$neighbourhood <- as.factor(airbnb_data$neighbourhood)
First of all I check that there are no NA’s left in the data:
airbnb_data[is.na(airbnb_data), ] %>% nrow()
## [1] 0
Now I use str() and summary() to check the structure of the data again:
str(airbnb_data)
## 'data.frame': 7364 obs. of 156 variables:
## $ host_since_year : num 2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 ...
## $ host_response_rate : num 1 1 1 1 1 1 1 1 1 1 ...
## $ host_acceptance_rate : num 0.91 0.91 0.91 1 1 1 1 0.77 1 0.55 ...
## $ host_is_superhost : chr "FALSE" "FALSE" "FALSE" "TRUE" ...
## $ host_neighbourhood : chr "El Gòtic" "El Gòtic" "El Gòtic" "L'Antiga Esquerra de l'Eixample" ...
## $ host_total_listings_count : int 3 3 3 4 4 4 4 1 7 7 ...
## $ host_has_profile_pic : chr "TRUE" "TRUE" "TRUE" "TRUE" ...
## $ host_identity_verified : chr "TRUE" "TRUE" "TRUE" "TRUE" ...
## $ neighbourhood : Factor w/ 69 levels "","Camp d'en Grassot i Gràcia Nova",..: 18 18 5 8 28 28 28 5 8 2 ...
## $ latitude : num 41.4 41.4 41.4 41.4 41.4 ...
## $ longitude : num 2.18 2.18 2.18 2.15 2.15 ...
## $ is_location_exact : chr "TRUE" "TRUE" "TRUE" "TRUE" ...
## $ property_type : chr "Apartment" "Apartment" "Apartment" "Apartment" ...
## $ room_type : chr "Private room" "Private room" "Entire home/apt" "Private room" ...
## $ accommodates : int 2 2 9 2 1 1 2 1 4 6 ...
## $ bathrooms : num 1 1 3 2 2 2 2 1 1 2 ...
## $ bedrooms : int 1 1 4 1 1 1 1 1 1 2 ...
## $ beds : int 1 1 6 1 1 1 1 1 2 4 ...
## $ bed_type : chr "Real Bed" "Real Bed" "Real Bed" "Real Bed" ...
## $ square_feet : num 807 807 807 807 807 807 807 807 807 807 ...
## $ price : num 80 100 227 40 30 30 45 33 130 110 ...
## $ security_deposit : num 100 150 200 0 0 0 0 0 150 500 ...
## $ cleaning_fee : num 20 40 67 6 6 6 6 15 59 30 ...
## $ guests_included : int 2 1 4 1 1 1 1 1 2 3 ...
## $ extra_people : num 0 0 25 10 0 0 10 0 10 25 ...
## $ minimum_nights : int 3 5 4 7 2 2 2 2 2 3 ...
## $ maximum_maximum_nights : int 90 120 1125 1125 730 1125 1125 65 364 365 ...
## $ availability_30 : int 16 30 30 26 14 21 17 29 30 30 ...
## $ availability_60 : int 16 60 60 52 31 44 43 59 60 60 ...
## $ availability_90 : int 16 90 90 75 46 70 69 83 80 90 ...
## $ availability_365 : int 88 180 180 348 318 345 344 358 336 365 ...
## $ number_of_reviews : int 2 8 149 303 238 258 222 73 339 39 ...
## $ review_scores_rating : num 100 68 91 94 95 96 95 94 94 88 ...
## $ review_scores_accuracy : num 10 8 10 10 10 10 10 10 10 9 ...
## $ review_scores_cleanliness : num 10 8 9 9 10 10 9 10 10 9 ...
## $ review_scores_checkin : num 10 7 10 10 10 10 10 10 9 9 ...
## $ review_scores_communication : num 10 9 10 10 10 10 10 10 10 9 ...
## $ review_scores_value : num 10 7 9 10 10 9 9 10 10 9 ...
## $ instant_bookable : chr "FALSE" "FALSE" "TRUE" "TRUE" ...
## $ cancellation_policy : chr "moderate" "moderate" "moderate" "moderate" ...
## $ require_guest_profile_picture : chr "FALSE" "FALSE" "FALSE" "FALSE" ...
## $ require_guest_phone_verification : chr "FALSE" "FALSE" "FALSE" "FALSE" ...
## $ calculated_host_listings_count : int 3 3 3 4 4 4 4 1 3 4 ...
## $ calculated_host_listings_count_entire_homes : int 1 1 1 0 0 0 0 0 3 4 ...
## $ calculated_host_listings_count_private_rooms: int 2 2 2 4 4 4 4 1 0 0 ...
## $ calculated_host_listings_count_shared_rooms : int 0 0 0 0 0 0 0 0 0 0 ...
## $ reviews_per_month : num 0.05 0.07 1.26 3.01 2.16 2.62 3.17 0.69 3.09 0.4 ...
## $ email : Factor w/ 2 levels "Yes","No": 1 1 1 1 1 1 1 1 1 1 ...
## $ phone : Factor w/ 2 levels "Yes","No": 1 1 1 1 1 1 1 1 1 1 ...
## $ facebook : Factor w/ 2 levels "Yes","No": 2 2 2 2 2 2 2 1 1 1 ...
## $ reviews : Factor w/ 2 levels "Yes","No": 1 1 1 1 1 1 1 1 1 1 ...
## $ manual_offline : Factor w/ 2 levels "Yes","No": 1 1 1 2 2 2 2 1 2 2 ...
## $ jumio : Factor w/ 2 levels "Yes","No": 1 1 1 1 1 1 1 1 1 1 ...
## $ offline_government_id : Factor w/ 2 levels "Yes","No": 1 1 1 1 1 1 1 2 2 2 ...
## $ sent_id : Factor w/ 2 levels "Yes","No": 2 2 2 2 2 2 2 2 2 2 ...
## $ selfie : Factor w/ 2 levels "Yes","No": 2 2 2 2 2 2 2 2 2 2 ...
## $ government_id : Factor w/ 2 levels "Yes","No": 1 1 1 1 1 1 1 1 1 1 ...
## $ identity_manual : Factor w/ 2 levels "Yes","No": 2 2 2 2 2 2 2 2 2 2 ...
## $ work_email : Factor w/ 2 levels "Yes","No": 1 1 1 2 2 2 2 2 1 2 ...
## $ TV : Factor w/ 2 levels "Yes","No": 2 2 2 2 2 2 2 2 2 2 ...
## $ Cable_TV : Factor w/ 2 levels "Yes","No": 1 1 1 2 2 2 2 2 2 2 ...
## $ Internet : Factor w/ 2 levels "Yes","No": 1 1 1 1 1 1 1 1 1 1 ...
## $ Wifi : Factor w/ 2 levels "Yes","No": 1 1 1 1 1 1 1 1 1 1 ...
## $ Air_conditioning : Factor w/ 2 levels "Yes","No": 1 1 1 2 2 2 2 1 1 1 ...
## $ Wheelchair_accessible : Factor w/ 2 levels "Yes","No": 2 2 2 2 2 2 2 2 2 1 ...
## $ Kitchen : Factor w/ 2 levels "Yes","No": 1 1 1 1 1 1 1 1 1 1 ...
## $ Paid_parking_off_premises : Factor w/ 2 levels "Yes","No": 2 2 1 2 2 2 2 2 1 2 ...
## $ Elevator : Factor w/ 2 levels "Yes","No": 1 1 1 1 1 1 1 1 2 1 ...
## $ Buzzerwireless_intercom : Factor w/ 2 levels "Yes","No": 2 2 1 1 2 2 1 1 1 1 ...
## $ Heating : Factor w/ 2 levels "Yes","No": 1 1 1 1 1 1 1 1 1 1 ...
## $ Familykid_friendly : Factor w/ 2 levels "Yes","No": 1 1 1 2 2 2 2 2 1 1 ...
## $ Washer : Factor w/ 2 levels "Yes","No": 1 2 1 1 1 1 1 2 1 1 ...
## $ Smoke_alarm : Factor w/ 2 levels "Yes","No": 1 2 1 2 2 2 2 2 2 2 ...
## $ Carbon_monoxide_alarm : Factor w/ 2 levels "Yes","No": 1 2 1 2 2 2 2 2 2 2 ...
## $ First_aid_kit : Factor w/ 2 levels "Yes","No": 2 2 1 2 2 2 2 2 2 2 ...
## $ Safety_card : Factor w/ 2 levels "Yes","No": 2 2 1 2 2 2 2 2 2 2 ...
## $ Fire_extinguisher : Factor w/ 2 levels "Yes","No": 1 2 1 2 2 2 2 2 2 2 ...
## $ Essentials : Factor w/ 2 levels "Yes","No": 1 2 1 1 1 1 1 1 1 1 ...
## $ Shampoo : Factor w/ 2 levels "Yes","No": 2 2 1 2 2 2 2 1 2 2 ...
## $ Hangers : Factor w/ 2 levels "Yes","No": 2 2 1 1 1 1 1 1 1 1 ...
## $ Hair_dryer : Factor w/ 2 levels "Yes","No": 2 2 1 1 1 1 1 1 1 1 ...
## $ Iron : Factor w/ 2 levels "Yes","No": 2 2 1 1 1 1 1 1 1 1 ...
## $ Laptop-friendly_workspace : Factor w/ 2 levels "Yes","No": 2 2 1 2 2 2 2 1 2 2 ...
## $ Private_entrance : Factor w/ 2 levels "Yes","No": 2 2 1 2 2 2 2 2 2 2 ...
## $ Outlet_covers : Factor w/ 2 levels "Yes","No": 2 2 2 2 2 2 2 2 2 2 ...
## $ High_chair : Factor w/ 2 levels "Yes","No": 2 2 1 2 2 2 2 2 1 2 ...
## $ Stair_gates : Factor w/ 2 levels "Yes","No": 2 2 2 2 2 2 2 2 2 2 ...
## $ Children’s_books_and_toys : Factor w/ 2 levels "Yes","No": 2 2 2 2 2 2 2 2 2 2 ...
## $ Window_guards : Factor w/ 2 levels "Yes","No": 2 2 2 2 2 2 2 2 2 2 ...
## $ Table_corner_guards : Factor w/ 2 levels "Yes","No": 2 2 2 2 2 2 2 2 2 2 ...
## $ Babysitter_recommendations : Factor w/ 2 levels "Yes","No": 2 2 2 2 2 2 2 2 2 2 ...
## $ Crib : Factor w/ 2 levels "Yes","No": 2 2 1 2 2 2 2 2 1 2 ...
## $ Pack_’n_Playtravel_crib : Factor w/ 2 levels "Yes","No": 2 2 2 2 2 2 2 2 1 2 ...
## $ Room-darkening_shades : Factor w/ 2 levels "Yes","No": 2 2 2 2 2 2 2 1 2 2 ...
## $ Children’s_dinnerware : Factor w/ 2 levels "Yes","No": 2 2 2 2 2 2 2 2 2 2 ...
## $ Hot_water : Factor w/ 2 levels "Yes","No": 2 2 1 1 1 1 1 2 1 2 ...
## $ Body_soap : Factor w/ 2 levels "Yes","No": 2 2 2 2 2 2 2 2 2 2 ...
## $ Bath_towel : Factor w/ 2 levels "Yes","No": 2 2 2 2 2 2 2 2 2 2 ...
## $ Toilet_paper : Factor w/ 2 levels "Yes","No": 2 2 2 2 2 2 2 2 2 2 ...
## [list output truncated]
summary(airbnb_data)
## host_since_year host_response_rate host_acceptance_rate host_is_superhost
## Min. :2009 Min. :0.0000 Min. :0.0000 Length:7364
## 1st Qu.:2012 1st Qu.:1.0000 1st Qu.:0.8900 Class :character
## Median :2013 Median :1.0000 Median :0.9800 Mode :character
## Mean :2013 Mean :0.9348 Mean :0.8748
## 3rd Qu.:2013 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :2014 Max. :1.0000 Max. :1.0000
##
## host_neighbourhood host_total_listings_count host_has_profile_pic
## Length:7364 Min. : 0.00 Length:7364
## Class :character 1st Qu.: 1.00 Class :character
## Mode :character Median : 3.00 Mode :character
## Mean : 15.68
## 3rd Qu.: 17.00
## Max. :170.00
##
## host_identity_verified neighbourhood latitude
## Length:7364 Eixample :1607 Min. :41.35
## Class :character Ciutat Vella : 796 1st Qu.:41.38
## Mode :character Sants-Montjuïc : 666 Median :41.39
## Dreta de l'Eixample: 400 Mean :41.39
## Gràcia : 393 3rd Qu.:41.40
## Sant Martí : 374 Max. :41.46
## (Other) :3128
## longitude is_location_exact property_type room_type
## Min. :2.089 Length:7364 Length:7364 Length:7364
## 1st Qu.:2.157 Class :character Class :character Class :character
## Median :2.167 Mode :character Mode :character Mode :character
## Mean :2.167
## 3rd Qu.:2.177
## Max. :2.222
##
## accommodates bathrooms bedrooms beds
## Min. : 1.000 Min. : 0.000 Min. : 0.000 Min. : 0.000
## 1st Qu.: 2.000 1st Qu.: 1.000 1st Qu.: 1.000 1st Qu.: 1.000
## Median : 4.000 Median : 1.000 Median : 1.000 Median : 2.000
## Mean : 3.789 Mean : 1.393 Mean : 1.769 Mean : 2.574
## 3rd Qu.: 5.000 3rd Qu.: 2.000 3rd Qu.: 2.000 3rd Qu.: 3.000
## Max. :20.000 Max. :16.000 Max. :16.000 Max. :40.000
##
## bed_type square_feet price security_deposit
## Length:7364 Min. : 0.0 Min. : 10.0 Min. : 0.0
## Class :character 1st Qu.: 180.0 1st Qu.: 40.0 1st Qu.:100.0
## Mode :character Median : 431.0 Median : 70.0 Median :160.0
## Mean : 358.8 Mean :101.5 Mean :193.2
## 3rd Qu.: 538.0 3rd Qu.:120.0 3rd Qu.:250.0
## Max. :2799.0 Max. :999.0 Max. :999.0
##
## cleaning_fee guests_included extra_people minimum_nights
## Min. : 0.00 Min. : 1.000 Min. : 0.00 Min. : 1.00
## 1st Qu.: 20.00 1st Qu.: 1.000 1st Qu.: 0.00 1st Qu.: 2.00
## Median : 40.00 Median : 1.000 Median : 0.00 Median : 3.00
## Mean : 50.52 Mean : 2.087 Mean : 9.84 Mean : 10.81
## 3rd Qu.: 65.00 3rd Qu.: 2.000 3rd Qu.: 18.00 3rd Qu.: 10.00
## Max. :540.00 Max. :150.000 Max. :268.00 Max. :1124.00
##
## maximum_maximum_nights availability_30 availability_60 availability_90
## Min. :1.000e+00 Min. : 0.00 Min. : 0.00 Min. : 0.00
## 1st Qu.:3.000e+02 1st Qu.: 0.00 1st Qu.: 2.00 1st Qu.:13.00
## Median :1.125e+03 Median :26.00 Median :49.00 Median :74.00
## Mean :5.840e+05 Mean :18.25 Mean :37.25 Mean :57.19
## 3rd Qu.:1.125e+03 3rd Qu.:30.00 3rd Qu.:59.00 3rd Qu.:89.00
## Max. :2.147e+09 Max. :30.00 Max. :60.00 Max. :90.00
##
## availability_365 number_of_reviews review_scores_rating review_scores_accuracy
## Min. : 0.0 Min. : 0.00 Min. : 20.00 Min. : 2.000
## 1st Qu.: 89.0 1st Qu.: 1.00 1st Qu.: 90.00 1st Qu.: 9.000
## Median :209.0 Median : 15.00 Median : 94.00 Median :10.000
## Mean :206.7 Mean : 56.84 Mean : 92.14 Mean : 9.517
## 3rd Qu.:346.0 3rd Qu.: 78.00 3rd Qu.: 96.00 3rd Qu.:10.000
## Max. :365.0 Max. :731.00 Max. :100.00 Max. :10.000
##
## review_scores_cleanliness review_scores_checkin review_scores_communication
## Min. : 2.000 Min. : 2.000 Min. : 2.000
## 1st Qu.: 9.000 1st Qu.:10.000 1st Qu.:10.000
## Median : 9.000 Median :10.000 Median :10.000
## Mean : 9.335 Mean : 9.699 Mean : 9.695
## 3rd Qu.:10.000 3rd Qu.:10.000 3rd Qu.:10.000
## Max. :10.000 Max. :10.000 Max. :10.000
##
## review_scores_value instant_bookable cancellation_policy
## Min. : 2.000 Length:7364 Length:7364
## 1st Qu.: 9.000 Class :character Class :character
## Median : 9.000 Mode :character Mode :character
## Mean : 9.073
## 3rd Qu.:10.000
## Max. :10.000
##
## require_guest_profile_picture require_guest_phone_verification
## Length:7364 Length:7364
## Class :character Class :character
## Mode :character Mode :character
##
##
##
##
## calculated_host_listings_count calculated_host_listings_count_entire_homes
## Min. : 1.00 Min. : 0.00
## 1st Qu.: 1.00 1st Qu.: 0.00
## Median : 3.00 Median : 1.00
## Mean : 15.44 Mean : 12.86
## 3rd Qu.: 18.00 3rd Qu.: 14.00
## Max. :132.00 Max. :132.00
##
## calculated_host_listings_count_private_rooms
## Min. : 0.000
## 1st Qu.: 0.000
## Median : 0.000
## Mean : 2.227
## 3rd Qu.: 1.000
## Max. :81.000
##
## calculated_host_listings_count_shared_rooms reviews_per_month email
## Min. :0.00000 Min. : 0.010 Yes:7180
## 1st Qu.:0.00000 1st Qu.: 0.340 No : 184
## Median :0.00000 Median : 0.800
## Mean :0.02879 Mean : 1.214
## 3rd Qu.:0.00000 3rd Qu.: 1.660
## Max. :8.00000 Max. :25.450
##
## phone facebook reviews manual_offline jumio
## Yes:7354 Yes:1673 Yes:6945 Yes: 433 Yes:5709
## No : 10 No :5691 No : 419 No :6931 No :1655
##
##
##
##
##
## offline_government_id sent_id selfie government_id identity_manual
## Yes:3275 Yes: 117 Yes:2249 Yes:6076 Yes:1986
## No :4089 No :7247 No :5115 No :1288 No :5378
##
##
##
##
##
## work_email TV Cable_TV Internet Wifi Air_conditioning
## Yes:1452 Yes: 0 Yes: 908 Yes:2675 Yes:7218 Yes:4949
## No :5912 No :7364 No :6456 No :4689 No : 146 No :2415
##
##
##
##
##
## Wheelchair_accessible Kitchen Paid_parking_off_premises Elevator
## Yes: 521 Yes:6741 Yes:2520 Yes:4354
## No :6843 No : 623 No :4844 No :3010
##
##
##
##
##
## Buzzerwireless_intercom Heating Familykid_friendly Washer Smoke_alarm
## Yes:2393 Yes:6264 Yes:3631 Yes:6194 Yes:1732
## No :4971 No :1100 No :3733 No :1170 No :5632
##
##
##
##
##
## Carbon_monoxide_alarm First_aid_kit Safety_card Fire_extinguisher Essentials
## Yes:1324 Yes:1617 Yes: 525 Yes:1641 Yes:6909
## No :6040 No :5747 No :6839 No :5723 No : 455
##
##
##
##
##
## Shampoo Hangers Hair_dryer Iron Laptop-friendly_workspace
## Yes:4480 Yes:6153 Yes:6009 Yes:5802 Yes:4424
## No :2884 No :1211 No :1355 No :1562 No :2940
##
##
##
##
##
## Private_entrance Outlet_covers High_chair Stair_gates
## Yes:1005 Yes: 80 Yes:1519 Yes: 20
## No :6359 No :7284 No :5845 No :7344
##
##
##
##
##
## Childrenâ\200\231s_books_and_toys Window_guards Table_corner_guards
## Yes: 512 Yes: 113 Yes: 22
## No :6852 No :7251 No :7342
##
##
##
##
##
## Babysitter_recommendations Crib Pack_â\200\231n_Playtravel_crib
## Yes: 319 Yes:1756 Yes: 975
## No :7045 No :5608 No :6389
##
##
##
##
##
## Room-darkening_shades Childrenâ\200\231s_dinnerware Hot_water Body_soap Bath_towel
## Yes: 887 Yes: 179 Yes:5115 Yes: 159 Yes: 159
## No :6477 No :7185 No :2249 No :7205 No :7205
##
##
##
##
##
## Toilet_paper Bed_linens Extra_pillows_and_blankets Ethernet_connection
## Yes: 159 Yes:3544 Yes:2090 Yes: 416
## No :7205 No :3820 No :5274 No :6948
##
##
##
##
##
## Pocket_wifi Microwave Coffee_maker Refrigerator Dishwasher
## Yes: 469 Yes:3995 Yes:3996 Yes:4367 Yes:2066
## No :6895 No :3369 No :3368 No :2997 No :5298
##
##
##
##
##
## Dishes_and_silverware Cooking_basics Oven Stove Single_level_home
## Yes:4370 Yes:3732 Yes:3255 Yes:3268 Yes: 408
## No :2994 No :3632 No :4109 No :4096 No :6956
##
##
##
##
##
## Patio_or_balcony Luggage_dropoff_allowed Wide_hallways
## Yes:2504 Yes:1778 Yes: 119
## No :4860 No :5586 No :7245
##
##
##
##
##
## No_stairs_or_steps_to_enter Wide_entrance_for_guests
## Yes: 220 Yes: 123
## No :7144 No :7241
##
##
##
##
##
## Flat_path_to_guest_entrance Well-lit_path_to_entrance Extra_space_around_bed
## Yes: 104 Yes: 188 Yes: 57
## No :7260 No :7176 No :7307
##
##
##
##
##
## Accessible-height_bed Accessible-height_toilet Wide_clearance_to_shower
## Yes: 59 Yes: 31 Yes: 10
## No :7305 No :7333 No :7354
##
##
##
##
##
## _toilet Wide_entryway Host_greets_you Waterfront Beachfront
## Yes: 10 Yes: 45 Yes:3285 Yes: 169 Yes: 133
## No :7354 No :7319 No :4079 No :7195 No :7231
##
##
##
##
##
## Handheld_shower_head Hot_water_kettle Ceiling_fan Beach_view Rain_shower
## Yes: 38 Yes: 68 Yes: 12 Yes: 2 Yes: 38
## No :7326 No :7296 No :7352 No :7362 No :7326
##
##
##
##
##
## Bidet Heated_towel_rack Balcony Printer Espresso_machine
## Yes: 26 Yes: 30 Yes: 71 Yes: 7 Yes: 85
## No :7338 No :7334 No :7293 No :7357 No :7279
##
##
##
##
##
## Formal_dining_area Day_bed Convection_oven Standing_valet
## Yes: 34 Yes: 19 Yes: 40 Yes: 12
## No :7330 No :7345 No :7324 No :7352
##
##
##
##
##
## Pillow-top_mattress Memory_foam_mattress En_suite_bathroom Outdoor_seating
## Yes: 27 Yes: 36 Yes: 44 Yes: 35
## No :7337 No :7328 No :7320 No :7329
##
##
##
##
##
## Mudroom Full_kitchen Paid_parking_on_premises Bedroom_comforts
## Yes: 9 Yes: 184 Yes:1377 Yes: 160
## No :7355 No :7180 No :5987 No :7204
##
##
##
##
##
## Bathroom_essentials Fixed_grab_bars_for_shower Shower_chair days_since_host
## Yes: 161 Yes: 7 Yes: 3 Min. :2099
## No :7203 No :7357 No :7361 1st Qu.:2476
## Median :2744
## Mean :2790
## 3rd Qu.:3070
## Max. :4211
##
## days_since_first_review days_since_last_review
## Min. : 106 Min. : 106.0
## 1st Qu.: 894 1st Qu.: 204.0
## Median :1526 Median : 221.0
## Mean :1510 Mean : 352.1
## 3rd Qu.:2031 3rd Qu.: 339.0
## Max. :3733 Max. :3385.0
##
Now I check the distribution of price, the variable that I will try to predict. I use the log of price, since it allows us to see the data in a more compact way. The shape seems pretty gaussian, but a little bit skewed to the left. The Average price is 101.5, the minimum 10 and the maximum about 1000 (all the prices are in USD Dollars)
ggplot(data = airbnb_data, aes(x = log(price))) +
geom_histogram(binwidth = 0.5, color = "black", fill = "skyblue") +
ggtitle("AirBnb Price per Night (Log scale)") +
theme(plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
axis.title = element_text(hjust = 0.5, size = 12),
axis.text = element_text(size = 12)) +
xlab("Log of Price per Night") + ylab("Number of apartments")

In this section I will be exploring some categorical variables, as well as their relationship with price, the variable I am intended to predict. First I aggregate the data to take the average night price by neighbourhood. Then I subtract the first row, since the value is clearly an outlier with an average price of $800 per night.
night_neighbourhood <- airbnb_data %>% group_by(neighbourhood) %>% summarize(avg_night_price = mean(price)) %>% arrange(desc(avg_night_price))
night_neighbourhood <- night_neighbourhood[-1, ] # first row is clearly an outlier
Now I plot the average price per night of the 10 most expensive (up) and the 10 cheapest (down) neighbourhoods in Barcelona
ggplot(night_neighbourhood[1:10, ], aes(x = reorder(neighbourhood, avg_night_price), y = avg_night_price,
fill = avg_night_price)) +
geom_bar(stat = "identity") + ggtitle("Top 10 most expensive neighbourhoods") +
theme(plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
axis.title = element_text(hjust = 0.5, size = 12),
axis.text = element_text(size = 12)) + scale_y_continuous(labels = function(x) paste0("$", x)) +
xlab("") + ylab("Airbnb Price per Night") + coord_flip() + theme(legend.position = "none") + scale_fill_gradient(low = "yellow", high = "red")

ggplot(night_neighbourhood[(nrow(night_neighbourhood) - 10) : nrow(night_neighbourhood), ], aes(x = reorder(neighbourhood, avg_night_price), y = avg_night_price,
fill = avg_night_price)) +
geom_bar(stat = "identity") + ggtitle("Top 10 cheapest neighbourhoods") +
theme(plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
axis.title = element_text(hjust = 0.5, size = 12),
axis.text = element_text(size = 12)) + scale_y_continuous(labels = function(x) paste0(x, "$")) +
xlab("Neighbourhood") + ylab("Airbnb Price per Night") + coord_flip() + theme(legend.position = "none") + scale_fill_gradient(low = "green", high = "blue")

In order to see all this insights clearly, I create a map, using the geospatial data from Barcelona, which I will obtain from the neighbourhoods_geojson dataset. This geospatial data is loaded as a “SpatialPolygonsDataFrame”, which is a kind of dataframe that includes polygons, which are used to create maps in R. This dataset includes a dataframe containing the neighbourhoods, as well as the spatial polygons:
summary(neighbourhoods_geojson)
## Object of class SpatialPolygonsDataFrame
## Coordinates:
## min max
## x 2.05247 2.22922
## y 41.31696 41.46828
## Is projected: FALSE
## proj4string : [+proj=longlat +datum=WGS84 +no_defs]
## Data attributes:
## neighbourhood neighbourhood_group
## Length:75 Length:75
## Class :character Class :character
## Mode :character Mode :character
Now I correct the misspellings and convert the variable neighbourhood of the dataframe data of neighbourhoods_geojson into a factor. Then I join this data with the night_neighbourhood dataframe I created previously, and use a loop to substitute the missing values of the price variable, with the mean of the available ones. This way, I obtain a dataset with each neighbourhood, its spatial polygons and its average night price.
neighbourhoods_geojson@data$neighbourhood <- as.factor(neighbourhoods_geojson@data$neighbourhood)
levels(neighbourhoods_geojson$neighbourhood) <- c("Baró de Viver" , "Can Baró" ,
"Can Peguera" , "Canyelles",
"Ciutat Meridiana" , "Diagonal Mar i el Front Marítim del Poblenou",
"El Baix Guinardó" , "El Barri Gòtic" ,
"El Besós i el Maresme", "El Bon Pastor" ,
"El Camp d'en Grassot i Gràcia Nova" , "El Camp de l'Arpa del Clot" ,
"El Carmel" , "El Clot" ,
"El Coll" , "El Congrés i els Indians" ,
"El Fort Pienc" , "El Guinardó" ,
"El Parc i la Llacuna del Poblenou" , "El Poble Sec" ,
"El Poblenou" , "El Putxet i el Farró" ,
"El Raval" , "El Turó de la Peira",
"Horta" , "Hostafrancs" ,
"L'Antiga Esquerra de l'Eixample" , "La Barceloneta" ,
"La Bordeta" , "La Clota" ,
"La Dreta de l'Eixample" , "La Font d'en Fargues" ,
"La Font de la Guatlla" , "La Guineueta" ,
"La Marina de Port" , "La Marina del Prat Vermell",
"La Maternitat i Sant Ramon" , "La Nova Esquerra de l'Eixample" ,
"La Prosperitat" , "La Sagrada Família" ,
"La Sagrera" , "La Salut" ,
"La Teixonera" , "La Trinitat Nova" ,
"La Trinitat Vella" , "La Vall d'Hebron" ,
"La Verneda i la Pau" , "La Vila de Gràcia",
"La Vila Olímpica del Poblenou", "Les Corts" ,
"Les Roquetes" , "Les Tres Torres",
"Montbau" , "Navas",
"Pedralbes" , "Porta" ,
"Provençals del Poblenou" , "Sant Andreu" ,
"Sant Antoni" , "Sant Genís dels Agudells" ,
"Sant Gervasi - Galvany" , "Sant Gervasi - la Bonanova" ,
"Sant Martí de Provençals" , "Sant Pere, Santa Caterina i la Ribera",
"Sants" , "Sants - Badal" ,
"Sarrià " , "Torre Baró" ,
"Vallbona" , "Vallcarca i els Penitents" ,
"Vallvidrera, el Tibidabo i les Planes" , "Verdun" ,
"Vilapicina i la Torre Llobeta" )
neighbourhoods_geojson@data <- left_join(neighbourhoods_geojson@data, night_neighbourhood[, 1:2]) %>% as.data.frame()
for(i in 1:nrow(neighbourhoods_geojson@data)){
if(is.na(neighbourhoods_geojson@data[i, "avg_night_price"]) == TRUE){
neighbourhoods_geojson@data[i, 3] <- mean(neighbourhoods_geojson@data[complete.cases(neighbourhoods_geojson@data), 3])
} else if(is.na(neighbourhoods_geojson@data[i, "avg_night_price"]) == FALSE){
}
}
I load the leaflet package to create an interactive map. This map shows the Average Airbnb Night Price by Neighbourhood in Barcelona.
library(leaflet)
pal <- colorNumeric(
palette = "YlGnBu",
domain = neighbourhoods_geojson@data$avg_night_price
)
price_per_neighbourhood <- leaflet(neighbourhoods_geojson) %>%
addTiles() %>% setView(lng = 2.1734, lat = 41.3851, zoom = 11.5) %>%
addPolygons(stroke = TRUE, fillColor = ~ pal(avg_night_price), fillOpacity = 0.8,
highlight = highlightOptions(weight = 2,
color = ~ pal(avg_night_price),
fillOpacity = 1,
bringToFront = TRUE),
label = ~neighbourhood,
smoothFactor = 0.2,
popup = ~ paste(paste(neighbourhood,":"), "<br/>","<b/>", paste("Avg Night Price: ", "$", round(avg_night_price)))) %>%
addLegend("bottomright", pal = pal, values = ~avg_night_price, opacity = 1.0,
title = "Average Airbnb Night Price",
labFormat = labelFormat(prefix = "$"), na.label="")
price_per_neighbourhood
As we can see, the most expensive neighbourhoods of Barcelona seem to be in the northwest and center parts of the city, and the cheapeast ones are most likely located in the northeast and south. This may be due to the presence of turistic attractions, mainly in the center (Plaça Catalunya, Portal de l’Àngel, Passeig de Gràcia) and others such as Sagrada Familia, where the average night price is $125.
Now I create another map, this time based on the type of property:
pal3 <- colorFactor(palette = c(
"dodgerblue2", "#E31A1C",
"green4",
"#6A3D9A",
"#FF7F00",
"black", "gold1",
"skyblue2", "#FB9A99",
"palegreen2",
"#CAB2D6",
"#FDBF6F",
"gray70", "khaki2",
"maroon", "orchid1", "deeppink1", "blue1", "steelblue4",
"darkturquoise", "green1"),
domain = airbnb_data$property_type)
typeofproperty <- list()
for (i in 1:length(levels(as.factor(airbnb_data$property_type)))) {
typeofproperty[[i]] <- airbnb_data %>% dplyr::filter(property_type == levels(as.factor(airbnb_data$property_type))[i])
}
names(typeofproperty) <- levels(as.factor(airbnb_data$property_type))
typeofproperty_map <- leaflet() %>% addTiles() %>% setView(lng = 2.1734, lat = 41.3851, zoom = 13)
for (i in 1:length(levels(as.factor(airbnb_data$property_type)))) {
typeofproperty_map <- typeofproperty_map %>% addCircles(data = typeofproperty[[i]], lat = ~latitude,
lng = ~longitude, color = ~pal3(property_type),
fillOpacity = 1, label = ~property_type,
popup = ~price, group = levels(as.factor(airbnb_data$property_type))[i])
}
typeofproperty_map <- typeofproperty_map %>% addLegend(data = airbnb_data, "topleft",
pal = pal3, values = ~property_type, title = "Property Type",
opacity = 1, group = "Legend")
groups <- c("Legend", levels(as.factor(airbnb_data$property_type)))
typeofproperty_map <- typeofproperty_map %>% addLayersControl(overlayGroups = groups,
options = layersControlOptions(collapsed = TRUE))
typeofproperty_map
Apartments are the most common Airbnb Property Types, being spread all over the city, condominiums lofts and aparthotels are also pretty extended.
Now I check the correlations of price and the other numerical variables.
library(corrr)
numeric_correlations <- select_if(airbnb_data, is.numeric) %>% correlate() %>% focus(price)
numeric_correlations_plot <- ggplot(data = numeric_correlations, aes(x = reorder(rowname, price), y = price)) +
geom_bar(aes(fill = price), stat = "identity") +
scale_fill_gradient(low = "blue", high = "red") +
ggtitle("Correlation of price and the other numeric variables") +
theme(plot.title = element_text(hjust = 0.5, size = 18, face = "bold"),
axis.title = element_text(hjust = 0.5, size = 14),
axis.text.x = element_text(size = 14, angle = 90),
legend.position = "none",
legend.title = element_blank()) +
xlab("Numeric variable") + ylab("Correlation") + coord_flip()
numeric_correlations_plot
The variables that seem to have more dependency on price are: accomodates, bedrooms, beds and guests_included. This means that these factors may affect the price more than other variables.
In this problem I am going to be trying 6 different Statistical Learning Models: Ridge Regression, Lasso Regression, Bagging, Random Forest, Boosting and an Ensemble Model created averaging the other models. In order to apply all this models, I’ll be using a training/testing partition approach: I’ll divide the existing data in two partitions: 60% for the training data, where I’ll train and optimize the model and 40% for the testing, where I’ll analyze the performance of each method.
set.seed(1)
train <- sample(1:nrow(airbnb_data), nrow(airbnb_data)/2)
test <- -train
Ridge regression is a type of regularized linear regression, which includes a squared penalty, multiplied by a shrinkage parameter lambda, which has to be chosen by cross-validation. It is basically a regression problem where the goal is to find the coefficients (betas) to optimize a penalized Residual Sum of Squares Formula (as seen in the following chunk): 
The coefficients are those that minimize the previous formula:
Source: The Elements of Statistical Learning by Robert Hastie, Trevor Tibshirani and Jerome Friedman
In matrix notation this is equivalent to:
Source: The Elements of Statistical Learning by Robert Hastie, Trevor Tibshirani and Jerome Friedman
Here I define x as a matrix (-1 because we exclude the intercept) and y as a vector of prices of airbnb. Then I perform cross-validation to find the optimal value of lambda.
library(glmnet)
x <- model.matrix(price ~., airbnb_data)[, -1]
y <- airbnb_data$price
cv.out_ridge <- cv.glmnet(x[train, ], y[train], alpha = 0)
cv.out_ridge$lambda.min
## [1] 226.1069
par(mfrow = c(1,1))
plot(cv.out_ridge, main = "10-CV to choose Best Lambda")
Here we can see the different values of lambda (the shrinkage penalty) and their related MSE.
Now we create a model with the best lambda, make the predictions on the test set, and compare the results to the test result using the RMSE (Root Mean Squared Error), as the squared root of the mean of the sum of all the predicted values minus the test values all squared. Here we can see the best lambda and the rmse
bestlam_ridge <- cv.out_ridge$lambda.min
paste("The Best Lambda is: ", bestlam_ridge)
## [1] "The Best Lambda is: 226.106905212365"
ridge.mod <- glmnet(x[train, ], y[train], alpha = 0, lambda = bestlam_ridge)
ridge.pred <- predict(ridge.mod, s = bestlam_ridge, newx = x[test,])
rmse_ridge <- sqrt(mean((ridge.pred - y[test]) ^ 2 ))
paste("RMSE: ", rmse_ridge)
## [1] "RMSE: 86.2334121237272"
Now we check how well the model fitted the data visually, by plotting the test values versus the predicted values. As we can see there is kind of a correlation between the test and the predicted values, and it is not bad as a starting model.
ridge_prediction <- cbind(ridge.pred, y[test]) %>% as.data.frame()
colnames(ridge_prediction) <- c("Prediction", "Test_value")
ridge_accuracy_plot <- ggplot(data = ridge_prediction, aes(x = Test_value, y = Prediction)) +
geom_point(color = "purple", alpha = 0.5) + ylim(0, 700) + xlim(0, 700) +
ggtitle("Ridge Regression: Prediction Accuracy") +
theme(plot.title = element_text(hjust = 0.5, face = "bold"),
axis.title = element_text(hjust = 0.5),
legend.position = "none",
legend.title = element_blank()) +
xlab("Test Value") + ylab("Predicted Value")
ridge_accuracy_plot
## Warning: Removed 23 rows containing missing values (geom_point).

Lasso regression is a type of regularized linear regression, which includes an absolute value penalty, multiplied by a shrinkage parameter lambda, which has to be chosen by cross-validation.
Source: The Elements of Statistical Learning by Robert Hastie, Trevor Tibshirani and Jerome Friedman
Here I perform cross-validation to find the optimal value of lambda. In the plot we can see that the lambda that minimizes the MSE might be between exp(-2) and exp(0) = 1.
cv.out_lasso <- cv.glmnet(x[train, ], y[train], alpha = 1)
par(mfrow = c(1,1))
plot(cv.out_lasso, main = "10-CV to choose Best Lambda") # ens dona el menor valor de lambda pel cual
Here we can see the different values of lambda (the shrinkage penalty) and their related MSE.
Now I perform a lasso regression using the best lambda in the function glmnet(). The model improves the performance of the ridge, since it has lower RMSE.
bestlam_lasso <- cv.out_lasso$lambda.min
lasso.mod <- glmnet(x[train, ], y[train], alpha = 1, lambda = bestlam_lasso)
lasso.pred <- predict(lasso.mod, s = bestlam_lasso, newx = x[test,])
rmse_lasso <- sqrt(mean((lasso.pred - y[test]) ^ 2 ))
paste("The Best Lambda is: ", bestlam_lasso)
## [1] "The Best Lambda is: 1.32431284875504"
Now I plot the test values versus the predicted values, and I observe a similar pattern than in the ridge, there is a correlation, but it isn’t very strong yet.
lasso_prediction <- cbind(lasso.pred, y[test]) %>% as.data.frame()
colnames(lasso_prediction) <- c("Prediction", "Test_value")
lasso_accuracy_plot <- ggplot(data = lasso_prediction, aes(x = Test_value, y = Prediction)) +
geom_point(color = "purple", alpha = 0.5) + ylim(0, 700) + xlim(0, 700) +
ggtitle("Lasso Regression: Prediction Accuracy") +
theme(plot.title = element_text(hjust = 0.5, face = "bold"),
axis.title = element_text(hjust = 0.5),
legend.position = "none",
legend.title = element_blank()) +
xlab("Test Value") + ylab("Predicted Value")
lasso_accuracy_plot

Bagging stands for Bootstraped Aggregation, and basically averages the predictions obtained from n regression trees (each regression tree is trained in a different bootstraped subset of the data, but without restrictions in the number of predictors considered at each split step). The formula behind this algorithm is the following one:
Source: The Elements of Statistical Learning, 2nd Edition by Robert Hastie, Trevor Tibshirani and Jerome Friedman
Before using any function, I have to make sure the data is well spelled, since the function I am going to use doesn’t accept empty spaces or numbers in the column names.
colnames(airbnb_data)[c(28, 29, 30, 31)] <- c("availability_onemonth", "availability_twomonths", "availability_threemonths", "availability_oneyear")
colnames(airbnb_data)[c(88, 93, 95, 124)] <- c("Childrens_books_and_toys", "Pack_Playtravel_crib", "Childrens_dinnerware", "Toilet")
for(i in 1 : ncol(airbnb_data)){
colnames(airbnb_data)[i] <- gsub("-", "_", colnames(airbnb_data)[i])
}
In order to use Tree methods, R doesn’t allow to include in the model variables with more than 30 factors, that’s why I’ll have to exclude some columns during this process. The Bagging algorithm basically creates lots of regression trees, and predicts the average of all trees. The standard number of trees used is more or less 1000. In the following chunk I train the model, and evaluate the errors.
library(randomForest)
bagging_airbnb <- randomForest::randomForest(price ~., data = select(airbnb_data, -host_neighbourhood & - neighbourhood),
subset = train, mtry = ncol(airbnb_data) - 1, keep.forest = T, ntree = 1000)
yhat.bagging <- predict(bagging_airbnb, newdata = airbnb_data[test, ])
rmse_bagging <- sqrt(mean((airbnb_data$price[test] - yhat.bagging)^2))
plot_bag_data <- data.frame(test_value = airbnb_data$price[test], bag.pred = yhat.bagging)
Now I plot the predicted values vs the test values, in order to see how well is the model performing.
bag_accuracy_plot <- ggplot(data = plot_bag_data, aes(x = test_value, y = bag.pred, color = test_value)) +
geom_point(color = "purple", alpha = 0.5) + ylim(0, 700) + xlim(0, 700) +
ggtitle("Bagging: Prediction Accuracy") +
theme(plot.title = element_text(hjust = 0.5, face = "bold"),
axis.title = element_text(hjust = 0.5),
legend.position = "none",
legend.title = element_blank()) +
xlab("Test Value") + ylab("Predicted Value")
bag_accuracy_plot
The model seems pretty good, since there is a relationship between predicted and test values, but it could be much more better.
I use the same methodology than in the bagging implementation. This model looks even more robust than the previous due to the stronger linear relationship between predicted and the real values. Random Forest is basically the same as Bagging, but with the difference that at each step, the ith regression tree can only consider a random subset of variables.
rf_airbnb_sqrtpred <- randomForest::randomForest(price ~., data = select(airbnb_data, -host_neighbourhood & - neighbourhood),
subset = train, mtry = sqrt(ncol(airbnb_data) - 1), keep.forest = T, ntree = 1000)
yhat.rf_airbnb_sqrtpred <- predict(rf_airbnb_sqrtpred, newdata = airbnb_data[test, ])
rmse_rf_sqrtpred <- sqrt(mean((airbnb_data$price[test] - yhat.rf_airbnb_sqrtpred)^2))
plot_rfsqrt_data <- data.frame(test_value = airbnb_data$price[test], rfsqrt.pred = yhat.rf_airbnb_sqrtpred)
rfsqrt_accuracy_plot <- ggplot(data = plot_rfsqrt_data, aes(x = test_value, y = rfsqrt.pred)) +
geom_point(color = "purple", alpha = 0.5) + ylim(0, 700) + xlim(0, 700) +
ggtitle("Random Forest: Prediction Accuracy") +
theme(plot.title = element_text(hjust = 0.5, face = "bold"),
axis.title = element_text(hjust = 0.5),
legend.position = "none",
legend.title = element_blank()) +
xlab("Test Value") + ylab("Predicted Value")
rfsqrt_accuracy_plot
This model looks even more robust than the previous due to the stronger linear relationship between predicted and the real values.
Gradient boosting is also the result of combining n regression trees. It starts by predicting the variable with its mean, and then constructs n trees, predicting the residuals (the difference between price and its predicted value). This way each regression tree takes into account the errors from the previous one, creating a very robust model. The mathematical description of the Gradient Boosting algorithm is the following one:
Source: The Elements of Statistical Learning by Robert Hastie, Trevor Tibshirani and Jerome Friedman
In order to compute this algorithm, firsly I convert all characters into factors, since we can’t use characters in the formula.
I apply the algorithm following a similar procedure to the previous models, and obtain the following plot:
The model seems to be performing worse than the last two, but better than the ridge and the lasso.
I create an Ensemble Model combining the Ridge, Lasso, Bagging, Random Forest and Boosting results.
It performs very poorly, it is the worst method of all.
I create a dataframe storing all the RMSE results:
rmse_comparison <- rbind(ridge = round(rmse_ridge),
lasso = round(rmse_lasso),
bagging = round(rmse_bagging),
randomforest = round(rmse_rf_sqrtpred),
gboosting = round(rmse_gboosting),
ensemble = round(ensemble_rmse)) %>% as.data.frame()
colnames(rmse_comparison)[1] <- "RMSE"
rmse_comparison$method <- rownames(rmse_comparison)
rownames(rmse_comparison) <- 1:nrow(rmse_comparison)
rmse_comparison <- rmse_comparison[, c(2, 1)]
We can observe that the best model by far is the Random Forest, followed by the Bagging. It is curious that the Random Forest has outperformed the Boosting, since it is a more sophisticated learning method. Therefore chosen model is the Random Forest! 

